CSV FileLet's start off by providing some background information about this topic, defining the objectives of this project, and configuring some things.
The National Basketball Association (NBA) is a professional basketball league in the United States. There are $30$ teams in the league, divided evenly into $2$ conferences: the Eastern Conference and the Western Conference.
In the regular season, each team plays $82$ games. NBA regular season standings are determined by teams' win-loss records within their conferences.
The top $8$ teams from each conference advance to the playoffs. In the event of a tie in the standings, there is a tie-breaking procedure used to determine playoff seeding.
Starting in the $2019\text{-}20$ season, the NBA added a play-in tournament to give the $9^\text{th}$ and $10^\text{th}$ place teams in each conference the opportunity to earn a spot in the playoffs. It works as follows:
Once the final playoff seeding is determined, each team plays an opponent in a best-of-$7$ series. The first to win $4$ games advances to the next round. The first round is followed by the conference semifinals, then the conference finals, then the finals. The team that wins the NBA Finals in the NBA Champion.
The matchups for each round are determined using a traditional bracket structure, shown below:

We want to perform some analysis to see if we can identify factors underlying teams' level of success in the playoffs. Our ultimate goal will be to predict the outcome of the NBA playoffs using data from the regular season.
Can we accurately predict how many playoff games a team will win?
With this information, we could determine if a team is likely to:
These are some of the questions we want to answer as we go through the full data science pipeline.
We'll start by importing the Python libraries necessary for this project and configuring some things.
# import libraries
import warnings
import time
from pathlib import Path
import itertools
import textwrap
import requests
from bs4 import BeautifulSoup, Comment, MarkupResemblesLocatorWarning
import pandas as pd
import numpy as np
import plotly.io as pio
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from sklearn.linear_model import LinearRegression
# warnings
warnings.filterwarnings("ignore", category=MarkupResemblesLocatorWarning, module="bs4")
# requests
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36"
}
# pandas
pd.set_option("display.max_columns", None)
# plotly
pio.renderers.default = "notebook+plotly_mimetype+png+jpeg+svg+pdf"
Now, we need to collect data that we can use in our analysis.
We will will scrape data from Basketball Reference, a site that provides historical basketball data.
We will use data from the $2002\text{-}03$ season (when the NBA switched to the current playoff format, where every series is best-of-$7$) to the $2022\text{-}23$ season (the current season). For each season, we will scrape the following information:
For convenience, we will define a function called pages_to_scrape. Given a season, it will return a dictionary which maps the URL of the page we will be scraping to the list of information about each table we will be scraping from that page.
Each element in the list will be in the form of a dictionary with $2$ items:
id of the HTML table element we will be scraping from the pagepath where we will be storing the table as a CSVdef pages_to_scrape(season):
return {
f"https://www.basketball-reference.com/leagues/NBA_{season}.html": [
{
"id": "per_game-team",
"path": f"data/raw/{season}/regular_season/per_game_stats.csv",
},
{
"id": "advanced-team",
"path": f"data/raw/{season}/regular_season/advanced_stats.csv",
},
],
f"https://www.basketball-reference.com/leagues/NBA_{season}_standings.html": [
{
"id": "expanded_standings",
"path": f"data/raw/{season}/regular_season/standings.csv",
}
],
f"https://www.basketball-reference.com/playoffs/NBA_{season}.html": [
{
"id": "advanced-team",
"path": f"data/raw/{season}/playoffs/advanced_stats.csv",
}
],
}
To scrape the data from each page, we will do the following:
HTTP GET request to the appropriate URL using the Requests library.HTML table with the appropriate id.HTML table into a DataFrame using pandas.path exists in the filesystem using pathlib.DataFrame to a CSV file at the appropriate path using pandas.In our approach, there are a few issues that we have to address as well:
CSV files for a page already exist, we will not re-scrape the page.table elements are hidden inside of HTML comments, so we have to look there if a table can't be found normally.sleep between each HTTP GET request using the time library.# get list of seasons
seasons = list(range(2003, 2023 + 1))
# go through seasons
for season in seasons:
# get pages to scrape for season
pages = pages_to_scrape(season)
# go through pages
for url, infoList in pages.items():
# skip if all CSV files exist
if all([Path(info["path"]).exists() for info in infoList]):
continue
# request data from page
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, "html.parser")
# go through tables
for info in infoList:
# skip if CSV file exists
if Path(info["path"]).exists():
continue
# find table
table = soup.find("table", id=info["id"])
if table is None:
for comment in soup.find_all(
string=lambda text: isinstance(text, Comment)
):
comment_soup = BeautifulSoup(comment, "html.parser")
table = comment_soup.find("table", id=info["id"])
if table is not None:
break
# convert table to DataFrame and save as CSV
df = pd.read_html(str(table))[0]
Path(info["path"]).parent.mkdir(parents=True, exist_ok=True)
df.to_csv(info["path"], index=False)
# sleep for 10 seconds before next request
time.sleep(10)
Now that we have collected all the data, we need to process it and make it suitable for analysis. We will be making extensive use of pandas for manipulating DataFrames.
The first step is to figure out how we will be cleaning all the data.
First, let's define a function called rename_column. It will take in a column name and return a column name that has been modified to provide more consistency.
The renaming rules are as follows:
Unnamed, then we will replace it with a blank string.Rk, then we will replace it with Rank.Tm, then we will replace it with Team.Offense Four Factors (which is the name for a group of $4$ different columns), then we will replace it with a blank string. This way, the sub-columns will be assumed to be referring to the team's statistics on offense.Defense Four Factors (which is the name for a group of $4$ different columns), then we will replace it with Opp. This way, the sub-columns will be assumed to be referring to the opponent's statistics on offense (the team's statistics on defense).def rename_column(name):
if "Unnamed" in name:
return ""
elif name == "Rk":
return "Rank"
elif name == "Tm":
return "Team"
elif name == "Offense Four Factors":
return ""
elif name == "Defense Four Factors":
return "Opp"
else:
return name
Now, let's define a function called rename_columns. It takes in a DataFrame and a list of the header rows.
It works as follows:
rename_column.rename_column and combines them.def rename_columns(df, header):
if len(header) == 1:
df.columns = [rename_column(i) for i in df.columns]
else:
df.columns = [f"{rename_column(i)}{rename_column(j)}" for i, j in df.columns]
return df
CSV File¶Now, let's create a dictionary called files_to_clean with information about the CSV files we need to clean. It will map the filename to a dictionary with $2$ items:
header of the CSV file (an array of the row indices for the header)columns of the CSV file that we want to keep for now (after they have been renamed using the rename_columns function above)column_mappings, which map old column names to new column names (for renaming purposes)files_to_clean = {
"regular_season/per_game_stats.csv": {
"header": [0],
"columns": [
"Team",
"FG",
"FGA",
"FG%",
"3P",
"3PA",
"3P%",
"2P",
"2PA",
"2P%",
"FT",
"FTA",
"FT%",
"ORB",
"DRB",
"TRB",
"AST",
"STL",
"BLK",
"TOV",
"PF",
"PTS",
],
"column_mappings": {},
},
"regular_season/advanced_stats.csv": {
"header": [0, 1],
"columns": [
"Team",
"SRS",
"ORtg",
"DRtg",
"NRtg",
"Pace",
"FTr",
"3PAr",
"TS%",
"eFG%",
"TOV%",
"ORB%",
"FT/FGA",
"OppeFG%",
"OppTOV%",
"OppDRB%",
"OppFT/FGA",
],
"column_mappings": {"OppDRB%": "DRB%"},
},
"regular_season/standings.csv": {
"header": [0, 1],
"columns": [
"Rank",
"Team",
"Overall",
"PlaceHome",
"PlaceRoad",
"ConferenceE",
"ConferenceW",
],
"column_mappings": {
"Overall": "OverallRecord",
"PlaceHome": "HomeRecord",
"PlaceRoad": "RoadRecord",
"ConferenceE": "EastRecord",
"ConferenceW": "WestRecord",
},
},
"playoffs/advanced_stats.csv": {
"header": [0, 1],
"columns": ["Rank", "Team", "W", "L"],
"column_mappings": {"Rank": "PlayoffRank", "W": "PlayoffW", "L": "PlayoffL"},
},
}
Now, let's define a function called clean_csv. Given the path to a CSV file, it will return a DataFrame with a cleaned version of the data from the CSV file.
To clean a file, we do the following:
files_to_clean dictionary, obtain the information needed to clean the CSV file at the specified path.CSV file at the specified path into a DataFrame using pandas.rename_columns function.DataFrame.Team is League Average (since this is an aggregate of all the rows in the DataFrame, and we have no use for it).* character for any values in the Team column (since it is used to indicate if a team made the playoffs, but we already have that data).Seattle Supersonics with Seattle SuperSonics and Charlotte Hornets with Charlotte Bobcats in the Team column (to account for inconsistent naming of teams in the CSV files).def clean_csv(path):
# obtain information needed to clean CSV file
name = f"{Path(path).parents[0].name}/{Path(path).name}"
info = files_to_clean[name]
# read CSV file
df = pd.read_csv(path, header=info["header"])
# rename columns
rename_columns(df, info["header"])
# remove unnecessary columns
df = df[info["columns"]]
# rename remaining columns
df = df.rename(columns=info["column_mappings"])
# remove "League Average" row
df = df[df["Team"] != "League Average"]
# remove "*" from team names
df["Team"] = df["Team"].str.replace("*", "", regex=False)
# make team names consistent
df["Team"] = df["Team"].str.replace(
"Seattle Supersonics", "Seattle SuperSonics", regex=False
)
df["Team"] = df["Team"].str.replace(
"Charlotte Hornets", "Charlotte Bobcats", regex=False
)
# return DataFrame
return df
Now, we need to merge all the cleaned data into a single DataFrame.
For each season, we will do the following:
CSV files for the season using the clean_csv function.DataFrames for the season on the Team column using an outer join.Season column to the beginning of the DataFrame and fill all of the rows with the same season value.Then, we concatenate the DataFrames for each season into a single DataFrame.
# create main DataFrame
data = pd.DataFrame()
# go through seasons
for season in seasons:
# initialize season DataFrame
season_data = None
# get list of CSV files for season
infoList = list(itertools.chain(*pages_to_scrape(season).values()))
pathList = [info["path"] for info in infoList]
# go through CSV files
for index, path in enumerate(pathList):
# clean CSV file and store as DataFrame
df = clean_csv(path)
# merge DataFrame with season DataFrame
if index == 0:
season_data = df
else:
season_data = season_data.merge(df, on="Team", how="outer")
# add season column at beginning
season_data.insert(0, "Season", season)
# add season DataFrame to main DataFrame
data = pd.concat(objs=[data, season_data])
Now, we can look at the merged data.
data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 628 entries, 0 to 29 Data columns (total 48 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Season 628 non-null int64 1 Team 628 non-null object 2 FG 628 non-null float64 3 FGA 628 non-null float64 4 FG% 628 non-null float64 5 3P 628 non-null float64 6 3PA 628 non-null float64 7 3P% 628 non-null float64 8 2P 628 non-null float64 9 2PA 628 non-null float64 10 2P% 628 non-null float64 11 FT 628 non-null float64 12 FTA 628 non-null float64 13 FT% 628 non-null float64 14 ORB 628 non-null float64 15 DRB 628 non-null float64 16 TRB 628 non-null float64 17 AST 628 non-null float64 18 STL 628 non-null float64 19 BLK 628 non-null float64 20 TOV 628 non-null float64 21 PF 628 non-null float64 22 PTS 628 non-null float64 23 SRS 628 non-null float64 24 ORtg 628 non-null float64 25 DRtg 628 non-null float64 26 NRtg 628 non-null float64 27 Pace 628 non-null float64 28 FTr 628 non-null float64 29 3PAr 628 non-null float64 30 TS% 628 non-null float64 31 eFG% 628 non-null float64 32 TOV% 628 non-null float64 33 ORB% 628 non-null float64 34 FT/FGA 628 non-null float64 35 OppeFG% 628 non-null float64 36 OppTOV% 628 non-null float64 37 DRB% 628 non-null float64 38 OppFT/FGA 628 non-null float64 39 Rank 628 non-null int64 40 OverallRecord 628 non-null object 41 HomeRecord 628 non-null object 42 RoadRecord 628 non-null object 43 EastRecord 628 non-null object 44 WestRecord 628 non-null object 45 PlayoffRank 340 non-null float64 46 PlayoffW 340 non-null float64 47 PlayoffL 340 non-null float64 dtypes: float64(40), int64(2), object(6) memory usage: 240.4+ KB
data
| Season | Team | FG | FGA | FG% | 3P | 3PA | 3P% | 2P | 2PA | 2P% | FT | FTA | FT% | ORB | DRB | TRB | AST | STL | BLK | TOV | PF | PTS | SRS | ORtg | DRtg | NRtg | Pace | FTr | 3PAr | TS% | eFG% | TOV% | ORB% | FT/FGA | OppeFG% | OppTOV% | DRB% | OppFT/FGA | Rank | OverallRecord | HomeRecord | RoadRecord | EastRecord | WestRecord | PlayoffRank | PlayoffW | PlayoffL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2003 | Dallas Mavericks | 38.5 | 85.1 | 0.453 | 7.8 | 20.3 | 0.381 | 30.8 | 64.8 | 0.475 | 18.1 | 21.9 | 0.829 | 11.1 | 31.0 | 42.1 | 22.4 | 8.1 | 5.5 | 11.6 | 21.1 | 103.0 | 7.90 | 110.7 | 102.3 | 8.4 | 92.5 | 0.257 | 0.239 | 0.543 | 0.498 | 10.9 | 25.4 | 0.213 | 0.473 | 14.8 | 70.9 | 0.221 | 1 | 60-22 | 33-8 | 27-14 | 26-4 | 34-18 | 8.0 | 10.0 | 10.0 |
| 1 | 2003 | Golden State Warriors | 37.3 | 84.6 | 0.441 | 5.2 | 15.1 | 0.344 | 32.1 | 69.6 | 0.462 | 22.6 | 29.0 | 0.778 | 15.7 | 31.0 | 46.7 | 20.9 | 7.2 | 6.2 | 15.8 | 21.8 | 102.4 | -0.60 | 108.3 | 109.5 | -1.2 | 94.2 | 0.343 | 0.178 | 0.526 | 0.472 | 13.9 | 35.0 | 0.267 | 0.482 | 12.2 | 67.9 | 0.220 | 19 | 38-44 | 24-17 | 14-27 | 19-11 | 19-33 | NaN | NaN | NaN |
| 2 | 2003 | Sacramento Kings | 39.5 | 85.2 | 0.464 | 6.0 | 15.7 | 0.381 | 33.5 | 69.5 | 0.482 | 16.7 | 22.3 | 0.746 | 11.0 | 33.5 | 44.5 | 24.8 | 9.0 | 5.6 | 14.5 | 20.3 | 101.7 | 6.68 | 105.9 | 99.1 | 6.8 | 95.4 | 0.262 | 0.184 | 0.535 | 0.499 | 13.3 | 25.6 | 0.196 | 0.446 | 13.6 | 70.6 | 0.204 | 3 | 59-23 | 35-6 | 24-17 | 23-7 | 36-16 | 3.0 | 7.0 | 5.0 |
| 3 | 2003 | Los Angeles Lakers | 37.7 | 83.6 | 0.451 | 5.9 | 16.7 | 0.356 | 31.8 | 66.9 | 0.475 | 19.0 | 26.0 | 0.734 | 13.1 | 31.1 | 44.3 | 23.3 | 7.8 | 5.7 | 14.5 | 22.9 | 100.4 | 2.71 | 107.2 | 104.7 | 2.5 | 92.5 | 0.311 | 0.199 | 0.528 | 0.486 | 13.3 | 30.2 | 0.228 | 0.477 | 13.4 | 72.7 | 0.241 | 6 | 50-32 | 31-10 | 19-22 | 17-13 | 33-19 | 7.0 | 6.0 | 6.0 |
| 4 | 2003 | Milwaukee Bucks | 37.1 | 81.3 | 0.457 | 7.1 | 18.6 | 0.383 | 30.0 | 62.7 | 0.478 | 18.1 | 23.3 | 0.776 | 10.7 | 28.9 | 39.5 | 22.2 | 7.6 | 4.2 | 12.7 | 22.2 | 99.5 | -0.24 | 108.8 | 108.6 | 0.2 | 90.4 | 0.287 | 0.229 | 0.543 | 0.500 | 12.2 | 25.6 | 0.222 | 0.494 | 13.5 | 69.9 | 0.237 | 16 | 42-40 | 25-16 | 17-24 | 32-22 | 10-18 | 11.0 | 2.0 | 4.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 25 | 2023 | Orlando Magic | 40.5 | 86.3 | 0.470 | 10.8 | 31.1 | 0.346 | 29.8 | 55.2 | 0.539 | 19.6 | 25.0 | 0.784 | 10.2 | 33.1 | 43.2 | 23.2 | 7.4 | 4.7 | 15.1 | 20.1 | 111.4 | -2.39 | 111.6 | 114.2 | -2.6 | 99.3 | 0.290 | 0.361 | 0.573 | 0.532 | 13.4 | 23.8 | 0.227 | 0.550 | 13.1 | 77.7 | 0.211 | 25 | 34-48 | 20-21 | 14-27 | 20-32 | 14-16 | NaN | NaN | NaN |
| 26 | 2023 | Charlotte Bobcats | 41.3 | 90.4 | 0.457 | 10.7 | 32.5 | 0.330 | 30.5 | 57.9 | 0.528 | 17.6 | 23.6 | 0.749 | 11.0 | 33.5 | 44.5 | 25.1 | 7.7 | 5.2 | 14.2 | 20.3 | 111.0 | -5.89 | 109.2 | 115.3 | -6.1 | 100.8 | 0.261 | 0.360 | 0.550 | 0.516 | 12.3 | 23.8 | 0.195 | 0.544 | 12.5 | 75.5 | 0.211 | 27 | 27-55 | 13-28 | 14-27 | 15-37 | 12-18 | NaN | NaN | NaN |
| 27 | 2023 | Houston Rockets | 40.6 | 88.9 | 0.457 | 10.4 | 31.9 | 0.327 | 30.2 | 56.9 | 0.530 | 19.1 | 25.3 | 0.754 | 13.4 | 32.9 | 46.3 | 22.4 | 7.3 | 4.6 | 16.2 | 20.5 | 110.7 | -7.62 | 111.4 | 119.3 | -7.9 | 99.0 | 0.285 | 0.359 | 0.554 | 0.516 | 14.0 | 30.2 | 0.215 | 0.564 | 11.8 | 75.8 | 0.218 | 28 | 22-60 | 14-27 | 8-33 | 10-20 | 12-40 | NaN | NaN | NaN |
| 28 | 2023 | Detroit Pistons | 39.6 | 87.1 | 0.454 | 11.4 | 32.4 | 0.351 | 28.2 | 54.6 | 0.516 | 19.8 | 25.7 | 0.771 | 11.2 | 31.3 | 42.4 | 23.0 | 7.0 | 3.8 | 15.1 | 22.1 | 110.3 | -7.73 | 110.7 | 118.9 | -8.2 | 99.0 | 0.295 | 0.372 | 0.561 | 0.520 | 13.3 | 24.9 | 0.227 | 0.557 | 11.9 | 74.0 | 0.231 | 30 | 17-65 | 9-32 | 8-33 | 8-44 | 9-21 | NaN | NaN | NaN |
| 29 | 2023 | Miami Heat | 39.2 | 85.3 | 0.460 | 12.0 | 34.8 | 0.344 | 27.3 | 50.5 | 0.540 | 19.1 | 23.0 | 0.831 | 9.7 | 30.9 | 40.6 | 23.8 | 8.0 | 3.0 | 13.5 | 18.5 | 109.5 | -0.13 | 113.0 | 113.3 | -0.3 | 96.3 | 0.270 | 0.408 | 0.574 | 0.530 | 12.4 | 22.8 | 0.224 | 0.561 | 14.5 | 77.7 | 0.198 | 13 | 44-38 | 27-14 | 17-24 | 24-28 | 20-10 | 4.0 | 3.0 | 1.0 |
628 rows × 48 columns
We want to analyze teams' level of success in the playoffs, so it only makes sense to consider teams that made the playoffs in our analysis.
Thus, we will drop rows where the PlayoffRank, PlayoffW, or PlayoffL columns are NaN.
We will also drop rows where the sum of PlayoffW and PlayoffL is $0$ (which would indicate that a team made it to the play-in tournament, but did not advance to the playoffs).
data = data.dropna(subset=["PlayoffRank", "PlayoffW", "PlayoffL"])
data = data[(data["PlayoffW"] != 0) | (data["PlayoffL"] != 0)]
data = data.reset_index(drop=True)
As we can see, most of the columns in the DataFrame have the correct dtypes, but there are a few we need to correct. Specifically, PlayoffRank, PlayoffW, and PlayoffL need to be changed from float64 to int64.
data = data.astype({"PlayoffRank": "int64", "PlayoffW": "int64", "PlayoffL": "int64"})
Now, using our existing data, we want to add a few new columns to our DataFrame that will be useful in analysis.
One important column the data is missing is Conference. It will be useful to know which conference a team is in, because that determines playoff matchups.
The EastRecord and WestRecord columns are in the form W-L (where W is the number of wins and L is the number of losses), indicating a team's record against opponents in each conference. We can calculate the number of games a team played in each conference using the formula below:
Then, we can determine a team's Conference by checking if the team played more games in the East or the West.
def get_conference(row):
east_wins, east_losses = row["EastRecord"].split("-")
west_wins, west_losses = row["WestRecord"].split("-")
east_games = int(east_wins) + int(east_losses)
west_games = int(west_wins) + int(west_losses)
return "East" if east_games > west_games else "West"
data.insert(2, "Conference", data.apply(get_conference, axis=1))
Another important column the data is missing is Seed. This is likely to be important, because playoff seeding determines the matchups. As you can imagine the $1^\text{st}$ seed is typically expected to win more games than the $8^\text{th}$ seed, so this is an important column to include.
We can determine the playoff seeding by doing the following:
Season and then Conference.Rank within the group.data.insert(
3,
"Seed",
data.groupby(["Season", "Conference"])["Rank"]
.rank(method="dense", ascending=True)
.astype("int64"),
)
Currently, we have $5$ columns that have team record in the form of W-L (where W is the number of wins and L is the number of losses): OverallRecord, HomeRecord, RoadRecord, EastRecord, and WestRecord.
We want to standardize these columns by turning them into winning percentages using the formula below:
$$ \text{winning percentage} = \frac{\text{wins}}{\text{wins} + \text{losses}} $$This will give us $5$ new columns to replace the existing ones: W%, HomeW%, RoadW%, EastW%, and WestW%.
def record_to_win_pct(record):
wins, losses = record.split("-")
return int(wins) / (int(wins) + int(losses))
for (win_pct, record) in [
("W%", "OverallRecord"),
("HomeW%", "HomeRecord"),
("RoadW%", "RoadRecord"),
("EastW%", "EastRecord"),
("WestW%", "WestRecord"),
]:
data[record] = data[record].apply(record_to_win_pct)
data = data.rename(columns={record: win_pct})
Now that we know each team's Conference as well as its winning percentage against teams from each conference, let's add a column named ConferenceW%, which is a team's winning percentage against teams in its own conference.
data.insert(
len(data.columns) - 3,
"ConferenceW%",
data.apply(
lambda row: row["EastW%"] if row["Conference"] == "East" else row["WestW%"],
axis=1,
),
)
Since there are so many columns in our DataFrame, it can be difficult to see what each one means. For this reason, we will create a dictionary named data_labels mapping each column name to a label describing it. We can use these labels later on (e.g. in our plots).
data_labels = {
"Season": "Season",
"Team": "Team",
"Conference": "Conference",
"Seed": "Seed",
"FG": "Field Goals (FG)",
"FGA": "Field Goal Attempts (FGA)",
"FG%": "Field Goal Percentage (FG%)",
"3P": "3-Point Field Goals (3P)",
"3PA": "3-Point Field Goal Attempts (3PA)",
"3P%": "3-Point Field Goal Percentage (3P%)",
"2P": "2-Point Field Goals (2P)",
"2PA": "2-Point Field Goal Attempts (2PA)",
"2P%": "2-Point Field Goal Percentage (2P%)",
"FT": "Free Throws (FT)",
"FTA": "Free Throw Attempts (FTA)",
"FT%": "Free Throw Percentage (FT%)",
"ORB": "Offensive Rebounds (ORB)",
"DRB": "Defensive Rebounds (DRB)",
"TRB": "Total Rebounds (TRB)",
"AST": "Assists (AST)",
"STL": "Steals (STL)",
"BLK": "Blocks (BLK)",
"TOV": "Turnovers (TOV)",
"PF": "Personal Fouls (PF)",
"PTS": "Points (PTS)",
"SRS": "Simple Rating System (SRS)",
"ORtg": "Offensive Rating (ORtg)",
"DRtg": "Defensive Rating (DRtg)",
"NRtg": "Net Rating (NRtg)",
"Pace": "Pace Factor (Pace)",
"FTr": "Free Throw Attempt Rate (FTr)",
"3PAr": "3-Point Attempt Rate (3PAr)",
"TS%": "True Shooting Percentage (TS%)",
"eFG%": "Effective Field Goal Percentage (eFG%)",
"TOV%": "Turnover Percentage (TOV%)",
"ORB%": "Offensive Rebound Percentage (ORB%)",
"FT/FGA": "Free Throws Per Field Goal Attempt (FT/FGA)",
"OppeFG%": "Opponent Effective Field Goal Percentage (Opp. eFG%)",
"OppTOV%": "Opponent Turnover Percentage (Opp. TOV%)",
"DRB%": "Defensive Rebound Percentage (DRB%)",
"OppFT/FGA": "Opponent Free Throws Per Field Goal Attempt (Opp. FT/FGA)",
"Rank": "Rank",
"W%": "Winning Percentage",
"HomeW%": "Home Winning Percentage",
"RoadW%": "Road Winning Percentage",
"EastW%": "East Winning Percentage",
"WestW%": "West Winning Percentage",
"ConferenceW%": "Conference Winning Percentage",
"PlayoffRank": "Playoff Rank",
"PlayoffW": "Playoff Wins",
"PlayoffL": "Playoff Losses",
}
Now, our data should be nice and tidy, making it suitable for analysis. Let's take a quick look before moving on.
data
| Season | Team | Conference | Seed | FG | FGA | FG% | 3P | 3PA | 3P% | 2P | 2PA | 2P% | FT | FTA | FT% | ORB | DRB | TRB | AST | STL | BLK | TOV | PF | PTS | SRS | ORtg | DRtg | NRtg | Pace | FTr | 3PAr | TS% | eFG% | TOV% | ORB% | FT/FGA | OppeFG% | OppTOV% | DRB% | OppFT/FGA | Rank | W% | HomeW% | RoadW% | EastW% | WestW% | ConferenceW% | PlayoffRank | PlayoffW | PlayoffL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2003 | Dallas Mavericks | West | 1 | 38.5 | 85.1 | 0.453 | 7.8 | 20.3 | 0.381 | 30.8 | 64.8 | 0.475 | 18.1 | 21.9 | 0.829 | 11.1 | 31.0 | 42.1 | 22.4 | 8.1 | 5.5 | 11.6 | 21.1 | 103.0 | 7.90 | 110.7 | 102.3 | 8.4 | 92.5 | 0.257 | 0.239 | 0.543 | 0.498 | 10.9 | 25.4 | 0.213 | 0.473 | 14.8 | 70.9 | 0.221 | 1 | 0.731707 | 0.804878 | 0.658537 | 0.866667 | 0.653846 | 0.653846 | 8 | 10 | 10 |
| 1 | 2003 | Sacramento Kings | West | 3 | 39.5 | 85.2 | 0.464 | 6.0 | 15.7 | 0.381 | 33.5 | 69.5 | 0.482 | 16.7 | 22.3 | 0.746 | 11.0 | 33.5 | 44.5 | 24.8 | 9.0 | 5.6 | 14.5 | 20.3 | 101.7 | 6.68 | 105.9 | 99.1 | 6.8 | 95.4 | 0.262 | 0.184 | 0.535 | 0.499 | 13.3 | 25.6 | 0.196 | 0.446 | 13.6 | 70.6 | 0.204 | 3 | 0.719512 | 0.853659 | 0.585366 | 0.766667 | 0.692308 | 0.692308 | 3 | 7 | 5 |
| 2 | 2003 | Los Angeles Lakers | West | 5 | 37.7 | 83.6 | 0.451 | 5.9 | 16.7 | 0.356 | 31.8 | 66.9 | 0.475 | 19.0 | 26.0 | 0.734 | 13.1 | 31.1 | 44.3 | 23.3 | 7.8 | 5.7 | 14.5 | 22.9 | 100.4 | 2.71 | 107.2 | 104.7 | 2.5 | 92.5 | 0.311 | 0.199 | 0.528 | 0.486 | 13.3 | 30.2 | 0.228 | 0.477 | 13.4 | 72.7 | 0.241 | 6 | 0.609756 | 0.756098 | 0.463415 | 0.566667 | 0.634615 | 0.634615 | 7 | 6 | 6 |
| 3 | 2003 | Milwaukee Bucks | East | 7 | 37.1 | 81.3 | 0.457 | 7.1 | 18.6 | 0.383 | 30.0 | 62.7 | 0.478 | 18.1 | 23.3 | 0.776 | 10.7 | 28.9 | 39.5 | 22.2 | 7.6 | 4.2 | 12.7 | 22.2 | 99.5 | -0.24 | 108.8 | 108.6 | 0.2 | 90.4 | 0.287 | 0.229 | 0.543 | 0.500 | 12.2 | 25.6 | 0.222 | 0.494 | 13.5 | 69.9 | 0.237 | 16 | 0.512195 | 0.609756 | 0.414634 | 0.592593 | 0.357143 | 0.592593 | 11 | 2 | 4 |
| 4 | 2003 | Orlando Magic | East | 8 | 35.9 | 82.5 | 0.436 | 6.9 | 19.4 | 0.357 | 29.0 | 63.1 | 0.460 | 19.7 | 25.4 | 0.777 | 11.7 | 29.2 | 40.9 | 20.4 | 8.5 | 3.7 | 14.4 | 23.0 | 98.5 | -0.39 | 105.2 | 105.0 | 0.2 | 93.1 | 0.307 | 0.235 | 0.526 | 0.478 | 13.3 | 27.0 | 0.239 | 0.486 | 15.1 | 71.1 | 0.250 | 17 | 0.512195 | 0.634146 | 0.390244 | 0.574074 | 0.392857 | 0.574074 | 15 | 3 | 4 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 331 | 2023 | Phoenix Suns | West | 4 | 42.1 | 90.1 | 0.467 | 12.2 | 32.6 | 0.374 | 29.9 | 57.5 | 0.520 | 17.2 | 21.7 | 0.793 | 11.8 | 32.4 | 44.2 | 27.3 | 7.1 | 5.3 | 13.5 | 21.2 | 113.6 | 2.08 | 115.1 | 113.0 | 2.1 | 98.2 | 0.241 | 0.362 | 0.570 | 0.535 | 12.0 | 26.6 | 0.191 | 0.532 | 12.9 | 76.0 | 0.234 | 10 | 0.548780 | 0.682927 | 0.414634 | 0.500000 | 0.576923 | 0.576923 | 3 | 4 | 1 |
| 332 | 2023 | Los Angeles Clippers | West | 6 | 41.1 | 86.1 | 0.477 | 12.7 | 33.4 | 0.381 | 28.4 | 52.7 | 0.539 | 18.7 | 23.9 | 0.781 | 9.8 | 33.4 | 43.2 | 23.9 | 7.1 | 4.4 | 14.2 | 19.5 | 113.6 | 0.31 | 115.0 | 114.5 | 0.5 | 98.0 | 0.278 | 0.387 | 0.588 | 0.551 | 12.8 | 22.9 | 0.217 | 0.543 | 11.7 | 76.6 | 0.195 | 12 | 0.536585 | 0.560976 | 0.512195 | 0.566667 | 0.519231 | 0.519231 | 14 | 1 | 4 |
| 333 | 2023 | Brooklyn Nets | East | 6 | 41.5 | 85.1 | 0.487 | 12.8 | 33.8 | 0.378 | 28.7 | 51.3 | 0.559 | 17.7 | 22.1 | 0.800 | 8.2 | 32.3 | 40.5 | 25.5 | 7.1 | 6.2 | 13.7 | 21.1 | 113.4 | 1.03 | 115.0 | 114.1 | 0.9 | 98.3 | 0.260 | 0.397 | 0.598 | 0.562 | 12.7 | 19.6 | 0.208 | 0.530 | 12.2 | 73.7 | 0.212 | 9 | 0.548780 | 0.560976 | 0.536585 | 0.576923 | 0.500000 | 0.576923 | 16 | 0 | 4 |
| 334 | 2023 | Cleveland Cavaliers | East | 4 | 41.6 | 85.2 | 0.488 | 11.6 | 31.6 | 0.367 | 30.0 | 53.6 | 0.559 | 17.5 | 22.5 | 0.780 | 9.7 | 31.4 | 41.1 | 24.9 | 7.1 | 4.7 | 13.3 | 19.0 | 112.3 | 5.23 | 116.1 | 110.6 | 5.5 | 95.7 | 0.264 | 0.371 | 0.590 | 0.556 | 12.3 | 23.6 | 0.206 | 0.535 | 14.4 | 76.3 | 0.210 | 5 | 0.621951 | 0.756098 | 0.487805 | 0.653846 | 0.566667 | 0.653846 | 10 | 1 | 3 |
| 335 | 2023 | Miami Heat | East | 7 | 39.2 | 85.3 | 0.460 | 12.0 | 34.8 | 0.344 | 27.3 | 50.5 | 0.540 | 19.1 | 23.0 | 0.831 | 9.7 | 30.9 | 40.6 | 23.8 | 8.0 | 3.0 | 13.5 | 18.5 | 109.5 | -0.13 | 113.0 | 113.3 | -0.3 | 96.3 | 0.270 | 0.408 | 0.574 | 0.530 | 12.4 | 22.8 | 0.224 | 0.561 | 14.5 | 77.7 | 0.198 | 13 | 0.536585 | 0.658537 | 0.414634 | 0.461538 | 0.666667 | 0.461538 | 4 | 3 | 1 |
336 rows × 51 columns
Now, it is time to explore some of the trends in this data through data visualization. We will be making extensive use of Plotly for data visualization.
Before we start creating visualizations, there are a few things we need to set up.
Since the $2023$ playoffs are still ongoing, let's split our data into $2$ DataFrames: past_data and current_data. We will only use past_data for now.
past_data = data[data["Season"] < 2023]
current_data = data[data["Season"] == 2023]
Since there are lots of columns in our DataFrame, we will be creating lots of graphs that make use of subplots.
For convenience, we will define a function named get_subplots.
It takes in:
df: a DataFrame with our datalabels: a dictionary mapping column names to descriptions to use as labelsX_names: a list of columns we want to use as the $x$-axis of a subploty_name: the name of the column we want to use as the $y$-axis for every subplotrows: the number of rows in the overall figurecols: the number of columns in the overall figure.It will return a $\text{rows} \times \text{cols}$ figure with plots of y_name vs. x_name for every x_name in X_names using the data in df.
We will also add a regression line to each subplot using the linear regression (ordinary least squares) model from scikit-learn.
def get_subplots(df, labels, X_names, y_name, rows, cols):
# create subplot titles
titles = []
width = 30
for x_name in X_names:
title = f"{labels[y_name]} vs. {labels[x_name]}"
title = "<br>".join(textwrap.wrap(title, width=width))
titles.append(title)
# create figure with subplots
fig = make_subplots(rows=rows, cols=cols, subplot_titles=titles)
# go through each subplot
for (index, x_name) in enumerate(X_names):
# get row and column of subplot
row = index // cols + 1
col = index % cols + 1
# create hovertemplate for subplot
hovertemplate = "<br>".join(
[
"%{xaxis.title.text}=%{x}",
"%{yaxis.title.text}=%{y}",
"Team=%{customdata[0]}",
"Season=%{customdata[1]}",
]
)
# create subplot
subplot = go.Scatter(
x=df[x_name],
y=df[y_name],
mode="markers",
customdata=df[["Team", "Season"]],
hovertemplate=hovertemplate,
name="",
showlegend=False,
)
# add subplot to figure
fig.add_trace(subplot, row, col)
fig.update_xaxes(title_text=labels[x_name], row=row, col=col)
fig.update_yaxes(title_text=labels[y_name], row=row, col=col)
# compute regression line for subplot
model = LinearRegression()
X = df[[x_name]].values
y = df[y_name]
model.fit(X, y)
equation = f"y = {model.intercept_:.2f} + {model.coef_[0]:.2f}x"
score = model.score(X, y)
x_range = np.linspace(df[x_name].min(), df[x_name].max(), 100)
y_range = model.predict(x_range.reshape(-1, 1))
# create regression line for subplot
line = go.Scatter(
x=x_range,
y=y_range,
mode="lines",
hovertemplate=f"{equation}<br>r^2 = {score}",
name="Regression Line",
showlegend=False,
line={"color": "black"},
)
# add regression line for subplot to figure
fig.add_trace(line, row, col)
# update figure layout
fig.update_layout(height=rows * 400)
# return figure
return fig
As we can see, there are quite a lot of statistics in the DataFrame, so we will group them into the following categories:
Our goal is to see which statistics in each category are most correlated with playoff wins. Are there any statistics that individually work as decent predictors of playoff wins (even with just a simple linear regression model)? Or do we need to look at statistics together to make any sense of the data?
We will be making scatter plots of Playoff Wins vs. each statistic in each category, with regression lines added. This is just meant for preliminary analysis, as linear regression is probably overly simplistic when looking at most of these statistics. However, these plots might give us some helful intuition for understanding the data.
overall_rating_stats = ["Rank", "Seed", "SRS", "NRtg", "W%", "HomeW%", "RoadW%", "EastW%", "WestW%", "ConferenceW%"]
fig = get_subplots(past_data, data_labels, overall_rating_stats, "PlayoffW", 4, 3)
fig.show()
offensive_volume_stats = ["FG", "3P", "2P", "FT", "PTS"]
fig = get_subplots(past_data, data_labels, offensive_volume_stats, "PlayoffW", 2, 3)
fig.show()
offensive_efficiency_stats = ["FG%", "3P%", "2P%", "FT%", "TS%", "eFG%", "FTr", "3PAr", "FT/FGA", "Pace", "ORtg"]
fig = get_subplots(past_data, data_labels, offensive_efficiency_stats, "PlayoffW", 4, 3)
fig.show()
rebounding_stats = ["TRB", "ORB", "DRB", "ORB%", "DRB%"]
fig = get_subplots(past_data, data_labels, rebounding_stats, "PlayoffW", 2, 3)
fig.show()
passing_turnovers_stats = ["AST", "TOV", "TOV%"]
fig = get_subplots(past_data, data_labels, passing_turnovers_stats, "PlayoffW", 1, 3)
fig.show()
defense_stats = ["STL", "BLK", "PF", "OppeFG%", "OppTOV%", "OppFT/FGA", "DRtg"]
fig = get_subplots(past_data, data_labels, defense_stats, "PlayoffW", 3, 3)
fig.show()
Overall Rating
Offensive Volume
Offensive Efficiency
Rebounding
Passing/Turnovers
Defense